In order to get some hands-on experience, we apply the simple linear regression in an exercise. Therefore we load the students
data set. You may download the students.csv file here for exploration purposes or just follow the code lines below. Let us first import the packages we need for this exercise.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
We may also load the data set directly into our workspace using Pandas. We do so using the read_csv
function.
students = pd.read_csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")
We consider an example and sample 24 random rows, each representing one student. Note that we set a random state to ensure reproducibility. Each time we run the method, the identical random rows are sampled. Furthermore, we select the columns with the variables of interest and assign them to variables x
and y
. Then we calculate mean values for these variables using NumPy's mean function.
n = 24
students_sample = students.sample(n, random_state=14)
df = students_sample[['height', 'weight']]
x = df['height']
y = df['weight']
x_bar = np.mean(x)
y_bar = np.mean(y)
To get the first idea of our data, we create a simple scatter plot using the Matplotlib library.
fig = plt.figure(figsize=(10,8))
plt.scatter(x, y)
plt.xlabel(x.name)
plt.ylabel(y.name)
plt.show()
The visual inspection confirms our assumption that the relationship between the height and the weight variable is roughly linear. In other words, with increasing height, the individual student tends the have a higher weight.
As shown in the previous section, we may calculate the parameters $a$ and $b$ of a simple linear model analytically. Recall the equation for a linear model from sample data
$$\hat y = a + b x + e \text{,}$$for $b$
$$b = \frac{\sum_{i=1}^n ((x_i- \bar x) (y_i-\bar y))}{\sum_{i=1}^n (x_i-\bar x)^2} = \frac{cov(x,y)}{var(x)}\text{,}$$and $a$.
$$a = \bar y -b \bar x$$Let us try this in Python. We start by calculating $\beta_1$
b = np.sum((x-x_bar)*(y-y_bar))/np.sum((x-x_bar)**2)
print(b)
0.7494309085725562
The slope $b$ is approximaretely 0.75. Knowing $b$, we can calculate $a$
a = y_bar - b*x_bar
print(b)
0.7494309085725562
The intercept $a$ of the regression model is approximately 0.75. Thus we can write down the regression model
$$\text{weight} = a + b \times \text{height}$$Now, based on that equation we may determine the weight of a student given its height. Here are some sample predictions of the weight of students of heights $156\ cm$, $178\ cm$, and $192\ cm$. Note that we use f-strings which are only available in Python3.6 and later.
for height in [156, 178, 192]:
weight = np.round(a + b * height, 2)
print(f"weight_{height} = {a} + {b} x {height} cm = {weight} kg")
weight_156 = -55.146696199478015 + 0.7494309085725562 x 156 cm = 61.76 kg weight_178 = -55.146696199478015 + 0.7494309085725562 x 178 cm = 78.25 kg weight_192 = -55.146696199478015 + 0.7494309085725562 x 192 cm = 88.74 kg
The statsmodels module provides a function for ordinary least squares models called OLS()
. To get the intercept $a$, we need to provide a column of ones along with the predictor variable $\text{height}$. You can find some documentation here.
x_train = sm.add_constant(x)
model = sm.OLS(y, x_train)
fit = model.fit()
fit.params
const -55.146696 height 0.749431 dtype: float64
As we can see, the calculations of the parameters match. We get a summary of the model properties using the following line.
fit.summary()
Dep. Variable: | weight | R-squared: | 0.891 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.886 |
Method: | Least Squares | F-statistic: | 178.9 |
Date: | Tue, 20 Dec 2022 | Prob (F-statistic): | 4.81e-12 |
Time: | 20:58:51 | Log-Likelihood: | -54.440 |
No. Observations: | 24 | AIC: | 112.9 |
Df Residuals: | 22 | BIC: | 115.2 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -55.1467 | 9.568 | -5.764 | 0.000 | -74.990 | -35.304 |
height | 0.7494 | 0.056 | 13.376 | 0.000 | 0.633 | 0.866 |
Omnibus: | 3.844 | Durbin-Watson: | 2.262 |
---|---|---|---|
Prob(Omnibus): | 0.146 | Jarque-Bera (JB): | 2.935 |
Skew: | 0.855 | Prob(JB): | 0.230 |
Kurtosis: | 2.900 | Cond. No. | 3.28e+03 |
We have many more methods and attributes for the class RegressionResults that our fit
-object is a member of.
We can get the confidence intervals of our parameters $\beta_0$ (here called const
) and $\beta_1$ (here called height
) with the following method.
fit.conf_int()
0 | 1 | |
---|---|---|
const | -74.989570 | -35.303822 |
height | 0.633237 | 0.865625 |
The Pearson-residuals are available through the following parameter.
residuals = fit.resid_pearson
Ideally, the sum of our residuals $(\sum e)$ is close to zero.
print(np.sum(residuals))
9.36584143573782e-13
This is the case and strengthens our confidence in the model. Let us check for heteroscedasticity in the residuals. To do so, we plot the residuals and inspect them visually. There is a neat function residplot()
in the seaborn library.
fig = plt.figure(figsize=(8, 6))
predicted=fit.predict(x_train)
sns.residplot(predicted, residuals)
plt.xlabel("predicted")
plt.ylabel("residuals")
plt.show()
The residuals spread independently of each other around the line. We do not see a sign of heteroscedasticity in the data.
Finally, we visualize our model. Again, we use the seaborn library.
We set a few options to get the plot we want. ci
controls the confidence interval. We use the value 95
, to get a 95%-interval. We disable truncate
to get a continuous regression line. The ax
places the plot on the specific matplotlib Axes
object and label
defines the string used by the legend.
For the matplotlib plots, we specify colors, markers, and labels for the legend.
get_prediction()
is another particularly interesting method. The method returns $\hat y_i$ for each particular $x_i$.
Additionally, we may provide new data to predict $\hat y_i$ for any given $x_i$. Be aware that the new data must be accompanied by a column of constants. We add a column of ones.
And we add the so-called prediction bands. These include the uncertainty about future observations. They capture the majority of the observed points and do not collapse to a line as the number of observations increases. They are available via the summary_table()
method of the predictions and named obs_ci_upper
and obs_ci_lower
. Since they represent the confidence values at the predicted points and are unsorted, we sort them and draw a line plot.
new_data = sm.add_constant([165, 172, 183])
pred = fit.get_prediction()
pred_new = fit.get_prediction(new_data)
fig, ax = plt.subplots(figsize=(8, 6))
sns.regplot(x, y, ci=95, truncate=False, ax=ax, label="data")
ax.scatter(x, pred.predicted_mean, color = 'red', marker="^", label='predicted values')
ax.scatter(new_data[:,1], pred_new.predicted_mean, color = 'orange', marker="s", label="new data (predicted)")
ax.plot(sorted(x), sorted(pred.summary_frame()["obs_ci_upper"]), color="k", label="prediction bands")
ax.plot(sorted(x), sorted(pred.summary_frame()["obs_ci_lower"]), color="k")
ax.legend()
plt.show()
Have another look at the model summary. We will discuss some of these parameters in the next section, where we take a look at model diagnostics
fit.summary()
Dep. Variable: | weight | R-squared: | 0.891 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.886 |
Method: | Least Squares | F-statistic: | 178.9 |
Date: | Tue, 20 Dec 2022 | Prob (F-statistic): | 4.81e-12 |
Time: | 20:58:15 | Log-Likelihood: | -54.440 |
No. Observations: | 24 | AIC: | 112.9 |
Df Residuals: | 22 | BIC: | 115.2 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -55.1467 | 9.568 | -5.764 | 0.000 | -74.990 | -35.304 |
height | 0.7494 | 0.056 | 13.376 | 0.000 | 0.633 | 0.866 |
Omnibus: | 3.844 | Durbin-Watson: | 2.262 |
---|---|---|---|
Prob(Omnibus): | 0.146 | Jarque-Bera (JB): | 2.935 |
Skew: | 0.855 | Prob(JB): | 0.230 |
Kurtosis: | 2.900 | Cond. No. | 3.28e+03 |
Citation
The E-Learning project SOGA-Py was developed at the Department of Earth Sciences by Annette Rudolph, Joachim Krois and Kai Hartmann. You can reach us via mail by soga[at]zedat.fu-berlin.de.
Please cite as follow: Rudolph, A., Krois, J., Hartmann, K. (2023): Statistics and Geodata Analysis using Python (SOGA-Py). Department of Earth Sciences, Freie Universitaet Berlin.